Felix Wichmann: Computational models of vision: From early vision to deep convolutional neural networks
Talk by Felix Wichmann of the University of Tübingen. Given to the Redwood Center for Theoretical Neuroscience at UC Berkeley.
Early visual processing has been studied extensively over the last decades. From these studies a relatively standard model emerged of the first steps in visual processing. However, most implementations of the standard model cannot take arbitrary images as input, but only the typical grating stimuli used in many of the early vision experiments.
I will present an image based early vision model implementing our knowledge about early visual processing including oriented spatial frequency channels, divisive normalization and optimal decoding. The model explains the classical psychophysical data reasonably well, matching the performance of the non-image based models for contrast detection, contrast discrimination and oblique masking data. Leveraging the advantage of an image based model, I show how well our model performs for detecting Gabors masked by patches of natural scenes. Finally, we observe that our model units are extremely sparsely activated: each natural image patch activates few units and each unit is activated by few stimuli.
In computer vision recent and rapid advances in convolutional deep neural networks (DNNs) have resulted in image-based computational models of object recognition which, for the first time, rival human performance. However, although DNNs have undoubtedly proven their usefulness in computer vision, their usefulness as models of human vision is not yet equally clear. On the one hand, there is a growing number of studies finding similarities between DNNs trained on object recognition to properties of the monkey or human visual system. At the same time, however, there are, e.g., the well-known discrepancies as indicated by so-called adversarial examples. Given our knowledge of early visual processing, a potential source for this difference may already originate from differences in the processing of low-level features. To test this hypothesis we performed object identification experiments with DNNs and human observers on exactly the same images under conditions favouring single-fixation, purely feed-forward processing. Whilst we clearly find certain similarities, we also find strikingly non-human behaviour in DNNs, as well as marked differences between different DNNs despite similar overall object recognition performance. I will discuss possible reasons for our findings in the light of our knowledge of early visual processing in human observers.